196 research outputs found
Representation Policy Iteration
This paper addresses a fundamental issue central to approximation methods for
solving large Markov decision processes (MDPs): how to automatically learn the
underlying representation for value function approximation? A novel
theoretically rigorous framework is proposed that automatically generates
geometrically customized orthonormal sets of basis functions, which can be used
with any approximate MDP solver like least squares policy iteration (LSPI). The
key innovation is a coordinate-free representation of value functions, using
the theory of smooth functions on a Riemannian manifold. Hodge theory yields a
constructive method for generating basis functions for approximating value
functions based on the eigenfunctions of the self-adjoint (Laplace-Beltrami)
operator on manifolds. In effect, this approach performs a global Fourier
analysis on the state space graph to approximate value functions, where the
basis functions reflect the largescale topology of the underlying state space.
A new class of algorithms called Representation Policy Iteration (RPI) are
presented that automatically learn both basis functions and approximately
optimal policies. Illustrative experiments compare the performance of RPI with
that of LSPI using two handcoded basis functions (RBF and polynomial state
encodings).Comment: Appears in Proceedings of the Twenty-First Conference on Uncertainty
in Artificial Intelligence (UAI2005
Manifold Alignment using Procrustes Analysis
In this paper we introduce a novel approach to manifold alignment, based on Procrustes analysis. Our approach di®ers from \semi- supervised alignment in that it results in a mapping that is de¯ned everywhere { when used with a suitable dimensionality reduction method { rather than just on the training data points. We describe and evaluate our approach both theoretically and experimen- tally, providing results showing useful knowl- edge transfer from one domain to another. Novel applications of our method including cross-lingual information retrieval and trans- fer learning in Markov decision processes are presented
Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension
Large language models (LLMs) have shown their power in different areas.
Attention computation, as an important subroutine of LLMs, has also attracted
interests in theory. Recently the static computation and dynamic maintenance of
attention matrix has been studied by [Alman and Song 2023] and [Brand, Song and
Zhou 2023] from both algorithmic perspective and hardness perspective. In this
work, we consider the sparsification of the attention problem. We make one
simplification which is the logit matrix is symmetric. Let denote the
length of sentence, let denote the embedding dimension. Given a matrix , suppose and with , then we aim for finding (where ) such that \begin{align*} \| D(Y)^{-1} \exp( Y Y^\top ) -
D(X)^{-1} \exp( X X^\top) \|_{\infty} \leq O(r) \end{align*} We provide two
results for this problem.
Our first result is a randomized algorithm. It runs in
time, has succeed
probability, and chooses . Here
denotes the number of non-zero entries in . We use to denote the
exponent of matrix multiplication. Currently .
Our second result is a deterministic algorithm. It runs in
time and chooses . Here denote the -th column
of matrix .
Our main findings have the following implication for applied LLMs task: for
any super large feature dimension, we can reduce it down to the size nearly
linear in length of sentence
An Over-parameterized Exponential Regression
Over the past few years, there has been a significant amount of research
focused on studying the ReLU activation function, with the aim of achieving
neural network convergence through over-parametrization. However, recent
developments in the field of Large Language Models (LLMs) have sparked interest
in the use of exponential activation functions, specifically in the attention
mechanism.
Mathematically, we define the neural function using an exponential activation
function. Given a set of data points with labels where denotes
the number of the data. Here can be expressed as , where represents the number
of neurons, and are weights at time . It's standard in literature
that are the fixed weights and it's never changed during the training. We
initialize the weights with random Gaussian
distributions, such that and initialize
from random sign distribution for each .
Using the gradient descent algorithm, we can find a weight such that
holds with probability , where
and . To optimize
the over-parameterization bound , we employ several tight analysis
techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor,
Song and Woodruff ICML 2022]
- …